Notes: In this lesson, Solomon will be using a linear regression model to predict diamond price using other variables in the diamonds dataset. If you are not familiar with linear regression, you may want to take a break and go through Lesson 15 of Udacity’s Intro to Inferential Statistics course (https://classroom.udacity.com/courses/ud201), which covers linear regression. When you’re done, you’ll be ready to come back and apply your new knowledge to the diamonds dataset! ***
Let’s start by examining two variables in the data set. The scatterplot is a powerful tool to help you understand the relationship between two continuous variables.
We can quickly see if the relationship is linear or not. In this case, we can use a variety of diamond characteristics to help us figure out whether the price advertised for any given diamond is reasonable or a rip-off.
Let’s consider the price of a diamond and it’s carat weight. Create a scatterplot of price (y) vs carat weight (x).
Limit the x-axis and y-axis to omit the top 1% of values.
Hint: Use the quantile() function inside of xlim and ylim to omit the top 1% of values for each variable.
library(ggplot2)
ggplot(aes(x = carat, y = price), data = subset(diamonds, price < quantile(price, probs = .99) & carat < quantile(carat, probs = .99))) +
geom_point(fill = I('#F79420'), color = I('black'), shape = 21)
qplot(data = diamonds, x = carat, y = price,
xlim = c(0, quantile(diamonds$carat, 0.99)),
ylim = c(0, quantile(diamonds$price, 0.99))) +
geom_point(fill = I('#F79420'), color = I('black'), shape = 21)
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 926 rows containing missing values (geom_point).
ggplot(aes(x = carat, y = price), data = diamonds) +
scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99))) +
scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99))) +
geom_point(fill = I('#F79420'), color = I('black'), shape = 21)
## Warning: Removed 926 rows containing missing values (geom_point).
Response:
ggplot(aes(x = carat, y = price), data = diamonds) +
scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99))) +
scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99))) +
geom_point(fill = I('#F79420'), color = I('black'), shape = 21, alpha = 1/4) +
stat_smooth(method = 'lm')
## Warning: Removed 926 rows containing non-finite values (stat_smooth).
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).
Notes:
Notes: The Counte of Monte Cristo (https://www.youtube.com/watch?v=PJyRNeCErNE) and Knocked Up (https://www.youtube.com/watch?v=Be3sZKmfb9g) are two movies that have proposals without engagement rings.
These are the only two movies that came to mind…after A LOT of time. Can you think of any others? If so, share it in the discussions! ***
Notes: You can click on the packages tab in RStudio to determine which packages have been installed.
You may receive a message when installing the new packages. If so, click cancel, clear your workspace, and try installing the packages again.
In this video, Solomon works with the plyr package. We worked with the dplyr package to manipulate data frames and to create new ones throughout the course. dplyr is the latest version of plyr that is specifically for working with data frames.
Similarly, we worked with the reshape2 package, which is the newest version of the reshape package.
ggpairs output (https://s3.amazonaws.com/udacity-hosted-downloads/ud651/ggpairs_landscape.pdf) When you duplicate the plot matrix on your local machine, you may want to add a axisLabels = ‘internal’ argument to your ggpairs function call to have the variable names on the diagonal of the matrix rather than on the outside.
# install these if necessary
#install.packages('GGally') # for multiple particular plot
#install.packages('scales') # for a variey of things
#install.packages('memisc') # to summarize regression
#install.packages('lattice') #
#install.packages('MASS') # for various functions
#install.packages('car') # to recode variables
#install.packages('reshape2') #
#install.packages('plyr') # to create interesting summaries and transmissions
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
## Warning: package 'memisc' was built under R version 3.4.2
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
##
## percent
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, #axisLabels = 'internal',
lower = list(continuous = wrap("points", shape = I('.'))),
upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What are some things you notice in the ggpairs output? Response:
Notes: Create two histograms of the price variable and place them side by side on one output image.
We’ve put some code below to get you started.
The first plot should be a histogram of price and the second plot should transform the price variable using log10.
Set appropriate bin widths for each plot. ggtitle() will add a title to each histogram.
You can self-assess your work with the plots in the solution video.
library(gridExtra)
plot1 <- ggplot(aes(x = price), data = diamonds) +
geom_histogram(binwidth = 100, fill = I('#099DD9')) +
ggtitle('Price')
plot2 <- ggplot(aes(x = price), data = diamonds) +
geom_histogram(binwidth = 0.01, fill = I('#F79420')) +
scale_x_log10() +
ggtitle('Price (log10)')
grid.arrange(plot1, plot2, ncol=2)
Notes: When looking at these plots, what do you notice? Think specifically about the two peaks in the transformed plot and how it relates to the demand for diamonds.
Answer: There are two peaks in log scale plot, while in the normal plot I cannot distingush shuch peask. ***
Basic Structure of a Function (https://www.youtube.com/watch?v=Z1wB1rHAYzQ&list=PLOU2XLYxmsIK9qQfztXeybpHvru-TrqAP)
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point() +
scale_y_continuous(trans = log10_trans()) +
ggtitle('Price (log10) by Carat')
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
ggplot(aes(carat, price), data = diamonds) +
geom_point(alpha = 1/2, size = 0.75, position = 'jitter') +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).
head(sort(table(diamonds$carat), decreasing = T))
##
## 0.3 0.31 1.01 0.7 0.32 1
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing = T))
##
## 605 802 625 828 776 698
## 132 127 126 125 124 121
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing missing values (geom_point).
Notes: What makes a diamond sparkle? (http://www.bluenile.com/diamonds/diamond-cut)
Does clarity affect the sparkle? (http://www.bluenile.com/diamonds/diamond-clarity)
What’s a jeweler’s loupe? (http://en.wikipedia.org/wiki/Loupe)
Adjust the code below to color the points by clarity.
A layer called scale_color_brewer() has been added to adjust the legend and provide custom colors.
See if you can figure out what it does. Links to resources are in the Instructor Notes.
You will need to install the package RColorBrewer in R to get the same colors and color palettes.
# install and load the RColorBrewer package
#install.packages('RColorBrewer', dependencies = TRUE)
library(RColorBrewer)
ggplot2: scale_colour_brewer ggplot2: Color Brewer Palettes and Safe Colors/#palettes-color-brewer) ggplot2: Legends/)
override.aes() change legend symols to be larger or more be transparent than the plot’s ones.
ggplot(aes(x = carat, y = price, colour = clarity), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1693 rows containing missing values (geom_point).
Response:
Alter the code below.
ggplot(aes(x = carat, y = price, color = cut), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and cut')
## Warning: Removed 1696 rows containing missing values (geom_point).
Response:
Alter the code below.
ggplot(aes(x = carat, y = price, color = color), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Color', reverse = FALSE,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1688 rows containing missing values (geom_point).
Response:
Notes:
Response:
Notes: Linear Models and Operators in R (http://data.princeton.edu/R/linearModels.html)
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
#mtable(m1, m2, m3, m4, m5)
mtable(m1, m2, m3, m4, m5, sdigits = 3)
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color,
## data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = diamonds)
##
## ============================================================================================
## m1 m2 m3 m4 m5
## --------------------------------------------------------------------------------------------
## (Intercept) 2.821*** 1.039*** 0.874*** 0.932*** 0.415***
## (0.006) (0.019) (0.019) (0.017) (0.010)
## I(carat^(1/3)) 5.558*** 8.568*** 8.703*** 8.438*** 9.144***
## (0.007) (0.032) (0.031) (0.028) (0.016)
## carat -1.137*** -1.163*** -0.992*** -1.093***
## (0.012) (0.011) (0.010) (0.006)
## cut: .L 0.224*** 0.224*** 0.120***
## (0.004) (0.004) (0.002)
## cut: .Q -0.062*** -0.062*** -0.031***
## (0.004) (0.003) (0.002)
## cut: .C 0.051*** 0.052*** 0.014***
## (0.003) (0.003) (0.002)
## cut: ^4 0.018*** 0.018*** -0.002
## (0.003) (0.002) (0.001)
## color: .L -0.373*** -0.441***
## (0.003) (0.002)
## color: .Q -0.129*** -0.093***
## (0.003) (0.002)
## color: .C 0.001 -0.013***
## (0.003) (0.002)
## color: ^4 0.029*** 0.012***
## (0.003) (0.002)
## color: ^5 -0.016*** -0.003*
## (0.003) (0.001)
## color: ^6 -0.023*** 0.001
## (0.002) (0.001)
## clarity: .L 0.907***
## (0.003)
## clarity: .Q -0.240***
## (0.003)
## clarity: .C 0.131***
## (0.003)
## clarity: ^4 -0.063***
## (0.002)
## clarity: ^5 0.026***
## (0.002)
## clarity: ^6 -0.002
## (0.002)
## clarity: ^7 0.032***
## (0.001)
## --------------------------------------------------------------------------------------------
## R-squared 0.924 0.935 0.939 0.951 0.984
## adj. R-squared 0.924 0.935 0.939 0.951 0.984
## sigma 0.280 0.259 0.250 0.224 0.129
## F 652012.063 387489.366 138654.523 87959.467 173791.084
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -7962.499 -3631.319 -1837.416 4235.240 34091.272
## Deviance 4242.831 3613.360 3380.837 2699.212 892.214
## AIC 15930.999 7270.637 3690.832 -8442.481 -68140.544
## BIC 15957.685 7306.220 3761.997 -8317.942 -67953.736
## N 53940 53940 53940 53940 53940
## ============================================================================================
Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.
Video Notes:
Let’s put our model in a larger context. Assuming that the data is not somehow corrupted and we are not egregiously violating some of the key assumptions of linear regression (for example, violating the IID assumption by having a bunch of duplicated observations in our data set), what could be some problems with this model? What else should we think about when using this model?
Take your time to answer this question, do some qualitative research about the diamond market. See the links below to get started.
Diamond Prices over the Years (http://www.pricescope.com/diamond-prices/diamond-prices-chart)
Global Diamond Report (http://www.bain.com/publications/articles/global-diamond-report-2013.aspx)
Falling Supply and Rising Demand: Couples in Shanghai take to the Ring (http://diamonds.blogs.com/diamonds_update/diamond-prices/)
If you’d like to learn more about linear models and how to interpret regression coefficients, please refer to the following articles.
Interpreting Regression Coefficients in R on R Bloggers (http://www.r-bloggers.com/interpreting-regression-coefficient-in-r/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29)
Interpreting Regression Coefficients on the Analysis Factor blog (http://www.theanalysisfactor.com/interpreting-regression-coefficients/)
Fitting and Interpreting Linear Models by ŷhat (http://blog.yhat.com/posts/r-lm-summary.html)
Another Explanation of Factor Coefficients in Linear Models on Stats StackExchange http://stats.stackexchange.com/a/24256)
Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)
Response:
Notes: Your task is to build five linear models like Solomon did for the diamonds data set only this time you’ll use a sample of diamonds from the diamonds big data set.
Be sure to make use of the same variables (logprice, carat, etc.) and model names (m1, m2, m3, m4, m5).
To get the diamonds big data into RStudio on your machine, copy, paste, and run the code in the Instructor Notes. There’s 598,024 diamonds in this data set!
Since the data set is so large, you are going to use a sample of the data set to compute the models. You can use the entire data set on your machine which will produce slightly different coefficients and statistics for the models.
Your task is to write the code to create the models.
#install.packages('bitops')
#install.packages('RCurl')
#library('bitops')
#library('RCurl')
#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))
load("BigDiamonds.rda")
diamondsbig$logprice <- log(diamondsbig$price)
m1 <- lm(logprice ~I(carat^(1/3)),
data = diamondsbig[diamondsbig$price < 10000 &
diamondsbig$cert == "GIA",])
#m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamondsbig)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = logprice ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$price <
## 10000 & diamondsbig$cert == "GIA", ])
## m2: lm(formula = logprice ~ I(carat^(1/3)) + carat, data = diamondsbig[diamondsbig$price <
## 10000 & diamondsbig$cert == "GIA", ])
## m3: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut, data = diamondsbig[diamondsbig$price <
## 10000 & diamondsbig$cert == "GIA", ])
## m4: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color,
## data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert ==
## "GIA", ])
## m5: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert ==
## "GIA", ])
##
## ================================================================================================
## m1 m2 m3 m4 m5
## ------------------------------------------------------------------------------------------------
## (Intercept) 2.671*** 1.333*** 0.949*** 0.529*** -0.464***
## (0.003) (0.012) (0.012) (0.010) (0.009)
## I(carat^(1/3)) 5.839*** 8.243*** 8.633*** 8.110*** 8.320***
## (0.004) (0.022) (0.021) (0.017) (0.012)
## carat -1.061*** -1.223*** -0.782*** -0.763***
## (0.009) (0.009) (0.007) (0.005)
## cut: V.Good 0.120*** 0.090*** 0.071***
## (0.002) (0.001) (0.001)
## cut: Ideal 0.211*** 0.181*** 0.131***
## (0.002) (0.001) (0.001)
## color: K/L 0.123*** 0.117***
## (0.004) (0.003)
## color: J/L 0.312*** 0.318***
## (0.003) (0.002)
## color: I/L 0.451*** 0.469***
## (0.003) (0.002)
## color: H/L 0.569*** 0.602***
## (0.003) (0.002)
## color: G/L 0.633*** 0.665***
## (0.003) (0.002)
## color: F/L 0.687*** 0.723***
## (0.003) (0.002)
## color: E/L 0.729*** 0.756***
## (0.003) (0.002)
## color: D/L 0.812*** 0.827***
## (0.003) (0.002)
## clarity: I1 0.301***
## (0.006)
## clarity: SI2 0.607***
## (0.006)
## clarity: SI1 0.727***
## (0.006)
## clarity: VS2 0.836***
## (0.006)
## clarity: VS1 0.891***
## (0.006)
## clarity: VVS2 0.935***
## (0.006)
## clarity: VVS1 0.995***
## (0.006)
## clarity: IF 1.052***
## (0.006)
## ------------------------------------------------------------------------------------------------
## R-squared 0.888 0.892 0.899 0.937 0.969
## adj. R-squared 0.888 0.892 0.899 0.937 0.969
## sigma 0.289 0.284 0.275 0.216 0.154
## F 2700903.714 1406538.330 754405.425 423311.488 521161.443
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -60137.791 -53996.269 -43339.818 37830.414 154124.270
## Deviance 28298.689 27291.534 25628.285 15874.910 7992.720
## AIC 120281.582 108000.539 86691.636 -75632.827 -308204.540
## BIC 120313.783 108043.473 86756.037 -75482.557 -307968.400
## N 338946 338946 338946 338946 338946
## ================================================================================================
The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data
Notes:
Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601
#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
interval="prediction", level = .95)
exp(modelEstimate)
## fit lwr upr
## 1 5040.436 3730.34 6810.638
The prediction interval here may be slightly conservative, as the model errors are heteroskedastic over carat (and hence price) even after our log and cube-root transformations.
See the output of the following code.
dat = data.frame(m4$model, m4$residuals)
with(dat, sd(m4.residuals))
## [1] 0.2164168
with(subset(dat, carat > .9 & carat < 1.1), sd(m4.residuals))
## [1] 0.2178147
dat$resid <- as.numeric(dat$m4.residuals)
ggplot(aes(y = resid, x = round(carat, 2)), data = dat) +
geom_line(stat = "summary", fun.y = sd)
## Warning: Removed 1 rows containing missing values (geom_path).
How could we do better? If we care most about diamonds with carat weights between 0.50 and 1.50, we might restrict the data we use to fit our model to diamonds that are that size - we have enough data.
Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.
Notes: Tiffany vs. Costco: Which Diamond Ring is Better (http://www.businessweek.com/articles/2013-05-06/tiffany-vs-dot-costco-which-diamond-ring-is-better)
But Costco Sells Pricy Diamonds Too (http://www.costco.com/CatalogSearch?catalogId=10701&langId=-1&keyword=Engagement&storeId=10301&refine=30108%2b209%2b238%2b210%2b) ***
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!